Complete the following steps, using data on cities and towns in the US, area_data.Rds
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom 0.7.9 ✓ rsample 0.1.0
## ✓ dials 0.0.10 ✓ tune 0.1.6
## ✓ infer 1.0.0 ✓ workflows 0.2.4
## ✓ modeldata 0.1.1 ✓ workflowsets 0.1.0
## ✓ parsnip 0.1.7 ✓ yardstick 0.0.8
## ✓ recipes 0.1.17
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(scales)
ad<-read_rds("area_data.Rds")
split_data<-initial_split(ad)
ad_train<-training(split_data)
ad_test<-testing(split_data)
perc_homeown (dependent variable) as a function of college education college_educ and comment on what you see.ad%>%
ggplot(aes(x=perc_homeown))+
geom_density()
ad%>%
ggplot(aes(x=college_educ))+
geom_density()
gg<-ad%>%
ggplot(aes(y=perc_homeown,
x=college_educ,
text=paste(name,
"<br>",
"Homeownership:", percent(perc_homeown/100,accuracy = 1),
"<br>",
"% College Educated:", percent(college_educ/100,accuracy=1))))+
geom_point()
ggplotly(gg,tooltip ="text")
Modest negative relationship between percent college educated and percent owning their own homes.
ad_formula<-as.formula("perc_homeown~college_educ")
ad_rec<-recipe(ad_formula,ad_train)
lm_fit<-linear_reg()%>%
set_engine("lm")%>%
set_mode("regression")
ad_wf<-workflow()%>%
add_recipe(ad_rec)%>%
add_model(lm_fit)
ad_wf<-ad_wf%>%
fit(ad_train)
ad_wf%>%tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 72.1 0.711 101. 0
## 2 college_educ -0.119 0.0277 -4.32 0.0000181
For a one unit change in the percent of the population with a college degree, home ownership is predicted to decline by .12 percentage points.
ad_lf<-ad_wf%>%last_fit(split_data)
ad_lf$.metrics
## [[1]]
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 6.44 Preprocessor1_Model1
## 2 rsq standard 0.0284 Preprocessor1_Model1
The predictions from the model including only college education are off by about 6.9 percentage points on average.
ad_formula<-as.formula("perc_homeown~college_educ+income_75")
ad_rec<-recipe(ad_formula,ad_train)
ad_wf<-workflow()%>%
add_recipe(ad_rec)%>%
add_model(lm_fit)
ad_wf<-ad_wf%>%
fit(ad_train)
ad_wf%>%tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 66.5 1.00 66.4 4.86e-302
## 2 college_educ -0.308 0.0363 -8.50 1.16e- 16
## 3 income_75 0.298 0.0389 7.66 6.41e- 14
There’s a positive relationship between income and home ownership. For a one unit increase in the percent of people making over 75,000, home ownership is predicted to increase by .3 percentage points, even after controlling for percent college educated.
ad_lf<-ad_wf%>%last_fit(split_data)
ad_lf$.metrics
## [[1]]
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 6.29 Preprocessor1_Model1
## 2 rsq standard 0.0727 Preprocessor1_Model1
The accuracy of the model is slightly better, but the predictions in the testing dataset are still off by about 6.3 percentage points.
division) to your model and again repeat steps 3-5.Levels of division
ad%>%group_by(division)%>%count()
## # A tibble: 9 × 2
## # Groups: division [9]
## division n
## <fct> <int>
## 1 East North Central 159
## 2 West North Central 120
## 3 Mid-Atlantic 66
## 4 New England 26
## 5 East South Central 95
## 6 South Atlantic 153
## 7 West South Central 131
## 8 Mountain 94
## 9 Pacific 82
ad_formula<-as.formula("perc_homeown~college_educ+income_75+division")
ad_rec<-recipe(ad_formula,ad_train)%>%
step_dummy(division)
ad_wf<-workflow()%>%
add_recipe(ad_rec)%>%
add_model(lm_fit)
ad_wf<-ad_wf%>%
fit(ad_train)
ad_wf%>%tidy()
## # A tibble: 11 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 69.7 1.08 64.3 5.78e-292
## 2 college_educ -0.333 0.0330 -10.1 1.87e- 22
## 3 income_75 0.314 0.0374 8.39 2.89e- 16
## 4 division_West.North.Central -1.38 0.769 -1.80 7.26e- 2
## 5 division_Mid.Atlantic -0.421 0.951 -0.443 6.58e- 1
## 6 division_New.England 1.00 1.40 0.715 4.75e- 1
## 7 division_East.South.Central -3.71 0.837 -4.43 1.12e- 5
## 8 division_South.Atlantic -4.05 0.724 -5.59 3.22e- 8
## 9 division_West.South.Central -5.43 0.775 -7.01 5.71e- 12
## 10 division_Mountain -2.77 0.823 -3.36 8.12e- 4
## 11 division_Pacific -10.6 0.876 -12.1 6.93e- 31
The percent of people owning their own homes is lower in almost all census divisions than in the reference category of East North Central. Homeonwership rates in Mid Atlantic and in New England are not observably different than in the East North Central Division.
ad_lf<-ad_wf%>%last_fit(split_data)
ad_lf$.metrics
## [[1]]
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 5.49 Preprocessor1_Model1
## 2 rsq standard 0.295 Preprocessor1_Model1
Including census division results in a more accurate model. The rmse is now 5.2, indicating that the predictions in the testing dataset are off by about 5.2 percentage points.
This data comes from the American Community Survey of 2019. It covers all of the metro or micro statistical areas in the United States. It includes characteristics of these areas, include education, income, home ownership and others as described below.
| Name | Description |
|---|---|
| name | Name of Micro/Metro Area |
| college_educ | Percent of population with at least a bachelor’s degree |
| perc_commute_30p | Percent of population with commute to work of 30 minutes or more |
| perc_insured | Percent of population with health insurance |
| perc_homeown | Percent of housing units owned by occupier |
| geoid | Geographic FIPS Code (id) |
| income_75 | Percent of population with income over 75,000 |
| perc_moved_in | Percent of population that moved from another state in last year |
| perc_in_labor force | Percent of population in labor force |
| metro | Metropolitan Area? Yes/No |
| state | State Abbreviation |
| region | Census Region |
| division | Census Division |